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■ Abstract 

Abe-boost is a new line of boosting algorithms for multi-class classification, by utilizing the com- 
monly used sum-to-zero constraint. To implement abc-boost, a base class must be identified at each 
boosting step. Prior studies used a very expensive procedure based on exhaustive search for determining 
. the base class at each boosting step. Good testing performance of abc-boost (implemented as abc-mart 

and abc-logitboost) on a variety of datasets was reported. 

For large datasets, however, the exhaustive search strategy adopted in prior abc-boost algorithms can 
q | be too prohibitive. To overcome this serious limitation, this paper suggests a heuristic by introducing 

Gaps when computing the base class during training. That is, we update the choice of the base class only 
for every G boosting steps (i.e., G = 1 in prior studies). We test this idea on large datasets (Covertype 
and Poker) as well as datasets of moderate size. Our preliminary results are very encouraging. On the 
large datasets, when G < 100 (or even larger), there is essentially no loss of test accuracy compared to 
using G = 1. On the moderate datasets, no obvious loss of test accuracy is observed when G < 20 ~ 50. 
Therefore, aided by this heuristic of using gaps, it is promising that abc-boost will be a practical tool for 
. accurate multi-class classification. 

o . 

1 Introduction 

This study focuses on significantly improving the computational efficiency of abc-boost, a new line of 
^ boosting algorithms recently proposed for multi-class classification Hl|9l. Boosting iTTTl [31 l4l IT1 IT2T. l6l ITOl 

13 121 has been successful in machine learning and industry practice. 

In prior studies, abc-boost has been implemented as abc-mart [8] and abc-logitboost [9]. Therefore, for 
completeness, we first provide a review of logitboost [6] and mart (multiple additive regression trees) Q. 



1.1 Data Probability Model and Loss Function 

We denote a training dataset by {yi, x^}^, where N is the number of feature vectors (samples), x» is the ith 
feature vector, and yi 6 {0, 1, 2, K — 1} is the ith class label, where K > 3 in multi-class classification. 

Both logitboost [6] and mart @ can be viewed as generalizations to the classical logistic regression, 
which models class probabilities as 

e ^,fc(xi) 

Pijk = Pr ( yi = fc|xi) = ^k-i FiAxi y W 
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While logistic regression simply assumes Fj fc(xj) = /?£xj, logitboost and mart adopt the flexible "additive 
model," which is a function of M terms: 



M 



F( M \yL) = ^2p m h(x;a m ), (2) 



m=l 

where /i(x;a m ), the base (weak) learner, is typically a regression tree. The parameters, p m and a m , are 
learned from the data, by maximizing the joint likelihood, which is equivalent to minimizing the following 
negative log-likelihood loss function: 

Af A'-l 

L = ^Li, Li = r i,k !ogPi,fc (3) 

i=l k=0 

where j-j k = li£ Vi = k and rj f. = otherwise. For identifiability, Ylk=o Q f. = 0, i.e., the sum-to-zero 
constraint, is typically adopted 151 151 H31 171 H51 1151 IB1. 

1.2 The (Robust) Logitboost and Mart Algorithms 

The logitboost algorithm (6l builds the additive model © by a greedy stage-wise procedure, using a second- 
order (diagonal) approximation of the loss function (O. The standard practice is to implement logitboost 
using regression trees. The mart algorithm (5j is a creative combination of gradient descent and Newton's 
method, by using the first-order information of the loss function (O to construct the trees and using both the 
first- & second-order derivatives to determine the values of the terminal nodes. 

Therefore, both logitboost and mart require the first two derivatives of the loss function (O with respec- 
tive to the function values Fj ^. E|5] used the following derivatives: 

dLj , d 2 Li 

ri,k~Pi,k), =Pi,k[l-Pi,k) ■ ( 4 ) 



The recent work named robust logitboost [9] is a numerically stable implementation of logitboost. ^\ 
unified logitboost and mart by showing that their difference lies in the tree-split criterion for constructing 
the regression trees at each boosting iteration. 

1.2.1 Tree-Split Criteria for (Robust) Logitboost and Mart 

Consider N weights Wi, and iV response values Zi, i = 1 to N, which are assumed to be ordered according 
to the ascending order of the corresponding feature values. The tree-split procedure is to find the index s, 
1 < s < N, such that the weighted square error (SE) is reduced the most if split at s. That is, we seek the s 
to maximize the gain: 



Gain(s) = SE T - {SE L + SE R ) 



N 



i=l 

where 



■ s N 

- zl) 2 Wi + {Zi- Z R ) 2 Wi 

.«=! i=s+l 



(5) 



Ej=l z i w i - _ Ei=l z i w i - _ Ej=s+1 z % w i 

22i=i wi 2wi=i w i 22i= s +i w i 



2 



[9] showed the expression ((5]> can be simplified to be 



iV 



s+1 



Ei=l 



Ei=i ^ 



For logitboost, |6] used the weights lOj = pi d\ — pik) and the responses z% = „ r 'f 1 _ p ' ,fc s 

Pijfcl-^ Pi,k) 



, i.e., 



(6) 



r ■ i" \ [Ei=l Pi,fe)] 

LogitGatn(s) = =| 4^ ; — — + 



(j*i,fc - Pi.Jfe) Ei=l ( r «,A: - Pt.fc) 



(7) 



For marf, Q used the weights uij = 1 and the responses z^k = r^k — Pi,k, i.e., 

2 r at -t2 



MartGain(s) 



1 



i=l 



+ 



1 



iV-s 



N 



^2 ^ k 



1 

N 



N 



^2 (ri,k ~ Pi,k) 
,i=i 



. (8) 



1.2.2 The Robust Logitboost Algorithm 



Algorithm 1 Robust logitboost, which is very similar to the mart algorithm [5], except for Line 4. 



F itk = 0, Phk = i, k = to K - 1, i = 1 to JV 
For m = 1 to M Do 
For fc = to K - 1 Do 

{Rj,k,m} J =1 = J-terminal node regression tree from {r^k ~Pi,k, Xj}^. l5 with weights ~Pi.k) as in (O 



K 



-1 T, Xi eR jk/ 



'j,fe,m- x E Xl£H3ifcim (l-P.,fc)Pi, fc 
= -Pi.fc + ^Y^j-i Pj,k,m^-x,eR j „ 

End 

Pi, fe = exp(F vfe )/ J^fJo 1 exp(F M ) 
End 



Alg.[T]describes robust logitboost using the tree-split criterion (0. In Line 6, f is the shrinkage parameter 
and is normally set to be v < 0.1. Note that after trees are constructed, the values of the terminal nodes are 
computed by 

Enorfe z i,kWj,k _ "Ylinode r i,k ~ Pi,k 
J2node W i,k ^2nodePi,k(^ ~ Pi,k) 

which explains Line 5 of Alg. [TJ 



1.2.3 The Mart Algorithm 

The mart algorithm only uses the first derivative to construct the tree. Once the tree is constructed, Q 
applied a one-step Newton update to obtain the values of the terminal nodes. Interestingly, this one-step 
Newton update yields exactly the same equation as (©. In other words, (O is interpreted as weighted 
average in logitboost but it is interpreted as the one-step Newton update in mart. Thus, the mart algorithm 
is similar to Alg.[TJ we only need to change Line 4, by replacing (0 with ([8]). 
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2 Review Adaptive Base Class Boost (ABC-Boost) 

Developed by [8], the abc-boost algorithm consists of the following two components: 

1 . Using the widely-used sum-to-zero constraint ||6l 21 [T4J |7J [13] [El [151 on the loss function, one can 
formulate boosting algorithms only for K — 1 classes, by using one class as the base class. 

2. At each boosting iteration, adaptively select the base class according to the training loss (f3]>. (H 
suggested an exhaustive search strategy. 

[ 8 ] derived the derivatives of Q under the sum-to-zero constraint. Without loss of generality, we can 
assume that class is the base class. For any k ^ 0, 



dF ijk 
d 2 L t 



(n, - Pi,o) - {n,k - Pi,k) , (10) 



Pifii 1 ~Pi,o) + Pi,k{ 1 -Pi,k) +2Pi,0Pi,k- (11) 



l8l combined the idea of abc-boost with mart to develop abc-mart, which achieved good performance in 
multi-class classification. More recently, [9] developed abc-logitboost by combining abc-boost with robust 
logitboost. 



2.1 ABC-LogitBoost and ABC-Mart 

Alg. [2]presents abc-logitboost, using the derivatives in ( fTOt and (TTTb and the same exhaustive search strategy 
proposed in flSJ. Compared to Alg.[TJ abc-logitboost differs from (robust) logitboost in that they use different 
derivatives and abc-logitboost needs an additional loop to select the base class at each boosting iteration. 

Algorithm 2 Abc-logitboost using the exhaustive search strategy for the base class, as suggested in [8 ]. The 
vector B stores the base class numbers. 

1: F hk = 0, Pi ,k = fc = to K - 1, i = 1 to JV 

2: For m = 1 to M Do 

3: For b = to K - 1, Do 

4: For k = to K - 1, k ^ b, Do 

5: {Rj,k,m}j=i = J-terminal node regression tree from {-(r i>6 - p ijb ) + (r it k - Pi,k), *i}iLi with 

: weights p iib (l - p i>b ) + p iyk (l -p it k) + 2pi,bPi,k, in Sec. II. 2. II 

_ E^eflj k m -(n,b-Pi,b)+(n, k - Pi , k ) 

7: 9i,k,b — Fi,k + v Y^j=l Pj,k,m^XieRj,h,m 

8: End 

9: 9i,b,b = - Y,k^b9i,k,b 

10: q itk = exp^.fe.fc)/ J^f^Q 1 e w{9i,s,b) 

11: LW = -EtiEf=o 1 ^iog(9a) 

12: End 

13: B(m) = argmin 

L (b) 

b 

14: Fi t k = 9i,k,B{ m ) 

15: p ljfe = exp(F l:k )/ Ylfjo 1 exp(F, :S ) 

16: End 



Again, abc-logitboost differs from abc-mart only in the tree-split procedure (Line 5 in Alg. [2]). 
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2.2 Why Does the Choice of Base Class Matter? 

191 used the Hessian matrix, to demonstrate why the choice of the base class matters. 

The chose of the base class matters because of the diagonal approximation; that is, fitting a regression 
tree for each class at each boosting iteration. To see this, we can take a look at the Hessian matrix, for 
K = 3. Using the original logitboost/mart derivatives (HI), the determinant of the Hessian matrix is 



a 2 L,, 


d 2 L,, 


d 2 u 




dpopi 


&P0P2 


a 2 L t 


d 2 L t 


d 2 U 


dpiVo 




dpip2 


d 2 u 


a 2 L t 


d 2 u 


&P2P0 


&P2P1 


dp 2 



Pq(1 -Po) 
-PiPo 
-P2P0 



Pi 



-P0P1 

-Pi) 

-P1P\ 



-P0P2 
-P1P2 

P 2 (l -P2) 



as expected, because there are only K — 1 degrees of freedom. A simple fix is to use the diagonal ap- 
proximation (6l|5l. In fact, when trees are used as the weak learner, it seems one must use the diagonal 
approximation. 

Now, consider the derivatives (fTOl ) and (fTTTt used in abc-mart and abc-logitboost. This time, when K = 3 
and k = is the base class, we only have a 2 by 2 Hessian matrix, whose determinant is 

Po(l - Po) ~Pi) +2p Pi Po -pl+PoPi + P0P2 -P1P2 

P0~Pl+ PoPl + PDP2 ~ PlP2 Po(l - Po) +P2(l- P2) + 2p P2 
=P0Pl + P0P2 + PlP2 ~ P0P1 ~ P0Pl - P1P\ - P2P\ - PlP% - P2P% + QP0P1P2, 

which is non-zero and is in fact independent of the choice of the base class (even though we assume 
k = as the base in this example). In other words, the choice of the base class would not matter if the full 
Hessian is used. 

However, because we will have to use diagonal approximation in order to construct trees at each iteration, 
the choice of the base class will matter. 



d 2 L t 


d 2 L, 


dpi 


dp\p2 


d 2 U 


d 2 U 


dp2P\ 


dpi 



2.3 Datasets Used for Testing Fast ABC-Boost 

We will test fast abc-boost using a subset of the datasets in O, as listed in Table Q] Because the com- 
putational cost of abc-boost is not a concern for small datasets, this study focuses on fairly large datasets 
(Covertype and Poker) as well as datasets of moderate size (MnistlOk and M-Image). 



Table 1: Datasets 



dataset 


K 


# training 


#test 


# features 


Covertype290k 


7 


290506 


290506 


54 


Poker525k 


10 


525010 


500000 


25 


Poker275k 


10 


275010 


500000 


25 


MnistlOk 


10 


10000 


60000 


784 


M-Image 


10 


12000 


50000 


784 



2.4 Review the Detailed Experiment Results of ABC-Boost on MnistlOk and M-Image 

For these two datasets, [9] experimented with every combination of J G {4, 6, 8, 10, 12, 14, 16, 18, 20, 24, 30, 40, 50} 
and v S {0.04, 0.06, 0.08, 0.1}. The four boosting algorithms were ttained till the training loss © was close 
to the machine accuracy, to exhaust the capacity of the learners, for reliable comparisons, up to M = 10000 
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iterations. Since no obvious overfitting was observed, the test mis-classification errors at the last iterations 
were reported. 

Tableland Table [3] present the test mis-classification errors, which verify the consistent improvements 
of (A) abc-logitboost over (robust) logitboost, (B) abc-logitboost over abc-mart, (C) (robust) logitboost over 
mart, and (D) abc-mart over mart. The tables also verify that the performances are not too sensitive to the 
parameters (J and v). 

Table 2: MnistlOk. Upper table: The test mis-classification errors of mart and abc-mart (bold numbers). Bottom 
table: The test errors of logitboost and abc-logitboost (bold numbers) 









mart 


abc-mart 












v = 0.04 


v = 0.06 


v = 0.08 


v = 0.1 


J 




4 


3356 3060 


3329 3019 


3318 2855 


3326 2794 


J 




G 


3185 2760 


3093 2626 


3129 2656 


3217 2590 


J 




8 


3049 2558 


3054 2555 


3054 2534 


3035 2577 


J 




10 


3020 2547 


2973 2521 


2990 2520 


2978 2506 


J 




12 


2927 2498 


2917 2457 


2945 2488 


2907 2490 


J 




14 


2925 2487 


2901 2471 


2877 2470 


2884 2454 


J 




1G 


2899 2478 


2893 2452 


2873 2465 


2860 2451 


J 




18 


2857 2469 


2880 2460 


2870 2437 


2855 2454 


J 




20 


2833 2441 


2834 2448 


2834 2444 


2815 2440 


J 




24 


2840 2447 


2827 2431 


2801 2427 


2784 2455 


J 




30 


2826 2457 


2822 2443 


2828 2470 


2807 2450 


J 




40 


2837 2482 


2809 2440 


2836 2447 


2782 2506 


J 




50 


2813 2502 


2826 2459 


2824 2469 


2786 2499 








logitboost 


abc-logit 












v = 0.04 


v = 0.06 


v = 0.08 


v = 0.1 


J 




4 


2936 2630 


2970 2600 


2980 2535 


3017 2522 


J 




6 


2710 2263 


2693 2252 


2710 2226 


2711 2223 


J 




8 


2599 2159 


2619 2138 


2589 2120 


2597 2143 


J 




10 


2553 2122 


2527 2118 


2516 2091 


2500 2097 


J 




12 


2472 2084 


2468 2090 


2468 2090 


2464 2095 


J 




14 


2451 2083 


2420 2094 


2432 2063 


2419 2050 


J 




1G 


2424 2111 


2437 2114 


2393 2097 


2395 2082 


J 




18 


2399 2088 


2402 2087 


2389 2088 


2380 2097 


J 




20 


2388 2128 


2414 2112 


2411 2095 


2381 2102 


J 




24 


2442 2174 


2415 2147 


2417 2129 


2419 2138 


J 




30 


2468 2235 


2434 2237 


2423 2221 


2449 2177 


J 




40 


2551 2310 


2509 2284 


2518 2257 


2531 2260 


J 




50 


2612 2353 


2622 2359 


2579 2332 


2570 2341 
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Table 3: M-Image. Upper table: The test mis-classification errors of mart and abc-mart (bold numbers). Bottom 
table: The test of logitboost and abc-logitboost (bold numbers) 









mart 


abc-mart 
















v = 0.04 


v = 0.06 


v = 0.08 


v = 0.1 


J 


= 


4 


6536 5867 


6511 5813 


6496 


5774 


6449 


5756 


J 


= 


6 


6203 5471 


6174 5414 


6176 


5394 


6139 


5370 


J 


= 


8 


6095 5320 


6081 5251 


6132 


5141 


6220 


5181 


J 


= 


10 


6076 5138 


6104 5100 


6154 


5086 


6332 


4983 


J 


= 


12 


6036 4963 


6086 4956 


6104 


4926 


6117 


4867 


J 


= 


14 


5922 4885 


6037 4866 


6018 


4789 


5993 


4839 


J 


= 


16 


5914 4847 


5937 4806 


5940 


4797 


5883 


4766 


J 


= 


18 


5955 4835 


5886 4778 


5896 


4733 


5814 


4730 


J 


= 


20 


5870 4749 


5847 4722 


5829 


4707 


5821 


4727 


J 




24 


5816 4725 


5766 4659 


5785 


4662 


5752 


4625 


J 




30 


5729 4649 


5738 4629 


5724 


4626 


5702 


4654 


J 




40 


5752 4619 


5699 4636 


5672 


4597 


5676 


4660 


J 




50 


5760 4674 


5731 4667 


5723 


4659 


5725 


4649 








logitboost 


abc-logit 
















v = 0.04 


v = 0.06 


v = 0.08 


v = 0.1 


J 




4 


5837 5539 


5852 5480 


5834 


5408 


5802 


5430 


J 




G 


5473 5076 


5471 4925 


5457 


4950 


5437 


4919 


J 




8 


5294 4756 


5285 4748 


5193 


4678 


5187 


4670 


J 




10 


5141 4597 


5120 4572 


5052 


4524 


5049 


4537 


J 




12 


5013 4432 


5016 4455 


4987 


4416 


4961 


4389 


J 




14 


4914 4378 


4922 4338 


4906 


4356 


4895 


4299 


J 




16 


4863 4317 


4842 4307 


4816 


4279 


4806 


4314 


J 




18 


4762 4301 


4740 4255 


4754 


4230 


4751 


4287 


J 




20 


4714 4251 


4734 4231 


4693 


4214 


4703 


4268 


J 




24 


4676 4242 


4610 4298 


4663 


4226 


4638 


4250 


J 




30 


4653 4351 


4662 4307 


4633 


4311 


4643 


4286 


J 




40 


4713 4434 


4724 4426 


4760 


4439 


4768 


4388 


J 




50 


4763 4502 


4795 4534 


4792 


4487 


4799 


4479 



7 



3 Fast ABC-Boost 



Recall that, in abc-boost, the base class must be identified at each boosting iteration. The exhaustive search 
strategy used in |8j [9j is obviously very expensive. In this paper, our main contribution is a proposal for 
speeding up abc-boost by introducing Gaps when selecting the base class. Again, we illustrate our strategy 
using abc-mart and abc-logitboost, which are only two implementations of abc-boost so far. 

Assuming M boosting iterations, the computation cost of mart and logitboost is 0[KM). However, the 
computation cost of abc-mart and abc-logitboost O (K(K — 1)M), which can be prohibitive. 

The reason we need to select the base class is because we have to use the the diagonal approximation in 
order to fit a regression separately for each class at every boosting iteration. Based on this insight, we really 
do not have to re-compute the base class for every iteration. Instead, we only compute the base class for 
every G steps, where G is the gap and G = 1 means we select the base class for every iteration. 

After introducing gaps, the computation cost of fast abc-boost is reduced to O [K(K — 1) |y + (M — |y) (K — 1)) . 
One can verify that when G = (K — 1), the cost of fast abc-boost is at most twice as the cost of logitboost. 
As we increases G more, the additional computational overhead of fast abc-boost further diminishes. 

The parameter G can be viewed as a new tuning parameter. Our experiments (in the following subsec- 
tions) illustrate that when G < 100 (or G < 20 ~ 50), there would be no obvious loss of test accuracies in 
large datasets (or moderate datasets). 

3.1 Experiments on Large Datasets, Poker525k, Poker275k, and Covertype290k 

As presented in f9j, on the Poker dataset, abc-boost achieved very remarkable improvements over mart and 
logitboost, especially when the number of boosting iterations was not too large. In fact, even at M = 5000 
iterations, the mis-classification error of mart (or (robust) logitboost) is 3 times (or 1.5 times) as large as the 
error of abc-mart (or abc-logitboost); see the rightmost panel of FigureQ] 
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Figure 1: Poker525k Left panel: test mis-classification errors of abc-mart (with G = 
1, 5, 10, 20, 50, 100, 500, 1000) and mart, for all boosting iterations up to M = 5000 steps. We only label 
the curves which are distinguishable (in this case G = 500 and 1000). Middle panel: test mis-classification 
errors of abc-logitboost and (robust) logitboost. Note that, at M = 5000, the test error of abc-logitboost is 
significantly smaller than the test error of logitboost, even though, due to the scaling issue, the difference 
may be less obvious in the figure. Right panel: the ratios of test errors, i.e., mart over abc-mart and logit- 
boost over abc-logitboost, at the last (i.e., M = 5000) boosting iteration. The two dashed horizontal lines 
represent the test error ratios at G = 1 (i.e., the original abc-boost). Note that a ratio of 1.5 (or even 3) 
should be considered extremely large for classification tasks. 
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For all datasets, we experiment with G = 1 (i.e., the original abc-boost), 5, 10, 20, 50, 100, 500, 1000. 
As shown in Figure [T] using fast abc-boost with G < 100, there is no obvious loss of test accuracies on 
Poker525k. In fact, using abc-mart, even with G = 1000, there is only very little loss of accuracy. 

Note that it is possible for fast abc-boost to achieve smaller test errors than abc-boost; for example, the 
ratios of test errors in the right panel of Figure Q] may be below 1.0. This interesting phenomenon is not 
surprising. After all, G can be viewed as tuning parameter and using G > 1 may have some regularization 
effect because that would be less greedy. 

Figure [2]presents the test error results on Poker275k, which are very similar to the results on Poker525k. 
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Figure 2: Poker275k 
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Figure [3] presents the test error results on Covertype290k. For this dataset, even with G = 1000, we 
notice essentially no loss of test accuracies. 
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Figure 3: Covertype290k 
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3.2 Experiments on Moderate Datasets, M-Image and MnistlOk 

The situation is somewhat different on datasets that are not too large. Recall, for these two datasets, we 
terminate the training if the training loss © is to close to the machine accuracy, up to M = 10000 iterations. 

Figure@]and Figure|5]show that, on M-Image and MnistlOk, using fast abc-boost with G > 50 can result 
in non-negligible loss of test accuracies compared to using G = 1. When G is too large, e.g., G = 1000, it 
is possible that fast abc-boost may produce even larger test errors than mart or logitboost. 

Figure|4]and Figure |5]report the test errors for J = 20 and two shrinkages, v = 0.06, 0.1. It seems that, 
at the same G, using smaller v produces slightly better results. 
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Figure 4: M-lmage See the caption of Figure [Qfor explanations. 
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Figure 5: MnistlOk See the caption of Figure [Qfor explanations. 



The above experiments always use J = 20, which seems to be a reasonable number of terminal tree 
nodes for large or moderate datasets. Nevertheless, it would be interesting to experiment with other J 
values. Figure [6] presents the results on the MnistlOk dataset, for J = 6, 10, 16, 20, 24, 30. 

When J is small (e.g., J = 6), using G as large as 100 results in almost no loss of test accuracies. 
However, when J is large (e.g., J = 30), even with G = 50 may produce obviously less accurate results 
compared to G = 1. 
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Figure 6: MnistlOk 
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4 Conclusion 



This study proposes fast abc-boost to significantly improve the training speed of abc-boost, which suffered 
from serious problems of computational efficiency. Abc-boost is a new line of boosting algorithms for im- 
proving multi-class classification, which was implemented as abc-mart and abc-logitboost in prior studies. 
Abc-boost requires that a base class must be identified at each boosting iteration. The computation of the 
base class was based on an expensive exhaustive search strategy in prior studies. 

With fast abc-boost, we only need to update the choice of the base class once for every G iterations, 
where G can be viewed as Gaps and used as an additional tuning parameter. Our experiments on fairly large 
datasets show that the test errors are not sensitive to the choice of G, even with G = 100 or 1000. For 
datasets of moderate size, our experiments show that, when G < 20 ~ 50, there would be no obvious loss 
of test accuracies compared to the original abc-boost algorithms (i.e., G = 1). 

These preliminary results are very encouraging. We expect fast abc-boost will be a practical tool for 
accurate multi-class classification. 
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